We present our CVPR tutorial proposal on Recent Advances in Vision Foundation Models, a topic that has garnered significant attention from the computer vision community. Our tutorial will cover the most advanced directions in designing and developing vision foundation models, including the state-of-the-art approaches and principles in (i) learning vision foundation models for multimodal understanding and generation, (ii) scaling test-time compute and enabling the self-training of foundation models to improve themselves on reasoning and perception, and (iii) physical and virtual agents based on vision foundation models that can take actions for robotics and in virtual environments.
You are welcome to join our tutorial either in-person or virtually via Zoom (Click into the CVPR2025 portal to find the Zoom link).
Afternoon Session | ||
13:00 - 13:50 | Advancing Multimodal LLMs: From Seeing to Understanding and Acting [Slides] | Zhe Gan |
13:50 - 14:40 | Multimodal Reasoning for Visual-Centric Long-Horizon Tasks [Slides] | Zhengyuan Yang |
14:40 - 15:00 | Coffee Break & QA | |
15:00 - 15: 50 | See. Think. Act. Training Multimodal Agents with Reinforcement Learning [Slides] | Linjie Li |
15:50 - 16: 40 | Towards Multimodal AI Agent That Can See, Think and Act [Slides] | Jianwei Yang |
16:40 - 17:00 | Closing Remarks & QA |
Contact the Organizing Committee: vlp-tutorial@googlegroups.com